Signi cantly Lower Entropy Estimates for Natural DNA Sequences

نویسندگان

  • David Loewenstern
  • Peter N. Yianilos
چکیده

If DNA were a random string over its alphabet fA;C;G; Tg, an optimal code would assign 2 bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than ve-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of non-redundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: i) predict the next amino acid based on inexact polypeptide matches, and ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of non-redundant coding sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Markov Models and Hidden Markov Models to Find Repetitive Extragenic Palindromic Sequences in Escherichia coli

This paper presents a technique for using simple Markov models and hidden Markov models (hmms) to search for interesting sequences in a database of DNA sequences. The models are used to create a cost map for each sequence in the database. These cost maps can be searched rapidly for subsequences that have signi cantly lower costs than a null model. Milosavljevi c's algorithmic signi cance test i...

متن کامل

Significantly Lower Entropy Estimates for Natural DNA Sequences

If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including ...

متن کامل

Tychoparthenogenesis and mixed mating in natural populations of the may ̄y Stenonema femoratum

Tychoparthenogenesis is a breeding system characterized by low population mean hatching success (usually <10%) of unfertilized eggs from females of typically sexually reproducing species. I used progeny-array analysis to estimate outcrossing and parthenogenetic rates for two tychoparthenogenetic populations of the may ̄y, Stenonema femoratum. Based on multilocus outcrossing rate estimates (tm), ...

متن کامل

Upper and Lower Bounds of Inequality of Opportunity: Theory and Evidence for Germany and the US

Theories of distributive justice distinguish between ethically acceptable inequalities –due to di¤erences in e¤ort –and unfair inequalities –due to circumstances beyond the sphere of individual responsibility. In this paper, we suggest a new estimator of inequality of opportunity (IOp) which allows identifying an upper bound for unfair inequalities in addition to the well-known lower bound esti...

متن کامل

Determination of Yield Bounds Prior to Routing

Integrated Circuit manufacturing complexities have resulted in decreasing product yields and reliabilities. This process has been accelerated with the advent of very deep sub-micron technologies coupled with the introduction of newer materials and technologies like copper interconnects, siliconon-insulator and increased wafer sizes. The need to improve product yields has been recognized and cur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996